Important Note: In case of execution issue due to “cannot open file” error, please download source files from Github folder A2_resources
Data cleaning and normalization have been performed on a RNAseq dataset produced by a recent paper titled Skeletal muscle transcriptome in healthy aging, published in Nature Communications in 2021. The raw data was obtained from GEO with the ID GSE164471 See the short bio below for some basic information of this dataset and its source. #### Dataset Meta Info Contact Location : 251 Bayview Blvd,Baltimore,USA
Contact Institute : NIA-IRP, NIH
Contact Department : Laboratory of Genetics and Genomics
Contact Laboratory : Computational Biology & Genomics Core
There are two supplemental files in this dataset:
1). Genes/RNAs (ENSG) — GSE164471_GESTALT_Muscle_ENSG_counts_annotated.csv
2). Transcripts/isoforms (ENST) — GSE164471_GESTALT_Muscle_ENST_TPM_annotated.csv
In this study, the authors defined ENSGs as genes/RNAs and ENSTs as transcripts/isoforms. For the purpose of our analysis, we will focus on RNA data.
## Warning: package 'circlize' was built under R version 4.1.2
Gene expression data is obtained from Skeletal muscle transcriptome in healthy aging[1].
Original raw data contains:
57773 protein-coding and non-coding RNAs across 53 healthy individuals (22-83 years old)
Filtered data with weakly expressed (low counts) genes removed contains
36084 protein-coding and a few non-coding RNAs across 53 healthy individuals (22-83 years old)
# RNAfilered <- read.table("https://github.com/bcb420-2022/Karen_Kuang/blob/main/A2_resources/RNAfiltered_mapped.txt", check.names = FALSE,sep = "\t")
# Unsure why this does not work. Messes up txt file content and format.
RNAfiltered <- read.table("RNAfiltered_mapped.txt", check.names = FALSE)
# RNAfiltered <- read.table("A2_resources/RNAfiltered_mapped.txt", check.names = FALSE)
# Column 2 ~ 54 represent individual study participant (the first column contains the ensemble IDs). BEWARE that the last 10 columns contain other info regarding RNAs in the Ensembl hg19 v82 (September 2015) database.
# colnames(RNAfile)
# Number of study participants we have : (53)
length(grep(colnames(RNAfiltered), pattern = "MUSCLE",ignore.case = TRUE))
## [1] 53
# Number of unique genes before cleaning: (57773)
# Number of unique genes after filtering: (36084)
length(RNAfiltered$ensembl_gene_id)
## [1] 36084
head(RNAfiltered)
## ensembl_gene_id hgnc_symbol
## TSPAN6 ENSG00000000003 TSPAN6
## TNMD ENSG00000000005 TNMD
## DPM1 ENSG00000000419 DPM1
## SCYL3 ENSG00000000457 SCYL3
## C1orf112 ENSG00000000460 C1orf112
## FGR ENSG00000000938 FGR
## MUSCLE_AGE22_M_GROUP20_34_CC_T415_Gt073_R1_COUNT
## TSPAN6 30
## TNMD 35
## DPM1 68
## SCYL3 191
## C1orf112 209
## FGR 65
## MUSCLE_AGE23_M_GROUP20_34_CC_T293_Gt022_R1_COUNT
## TSPAN6 57
## TNMD 14
## DPM1 302
## SCYL3 370
## C1orf112 430
## FGR 25
## MUSCLE_AGE25_F_GROUP20_34_CC_T430_Gt085_R1_COUNT
## TSPAN6 138
## TNMD 0
## DPM1 249
## SCYL3 587
## C1orf112 471
## FGR 23
## MUSCLE_AGE25_M_GROUP20_34_AS_T355_Gt046_R1_COUNT
## TSPAN6 81
## TNMD 2
## DPM1 262
## SCYL3 499
## C1orf112 353
## FGR 25
## MUSCLE_AGE26_F_GROUP20_34_CC_T370_Gt051_R1_COUNT
## TSPAN6 98
## TNMD 9
## DPM1 264
## SCYL3 449
## C1orf112 330
## FGR 49
## MUSCLE_AGE26_F_GROUP20_34_CC_T400_Gt045_R1_COUNT
## TSPAN6 192
## TNMD 11
## DPM1 308
## SCYL3 401
## C1orf112 324
## FGR 35
## MUSCLE_AGE27_F_GROUP20_34_CC_T385_Gt055_R1_COUNT
## TSPAN6 105
## TNMD 4
## DPM1 121
## SCYL3 293
## C1orf112 239
## FGR 38
## MUSCLE_AGE28_M_GROUP20_34_CC_T326_Gt001_R1_COUNT
## TSPAN6 398
## TNMD 24
## DPM1 642
## SCYL3 1251
## C1orf112 934
## FGR 8
## MUSCLE_AGE31_F_GROUP20_34_AS_T640_Gt090_R1_COUNT
## TSPAN6 166
## TNMD 7
## DPM1 204
## SCYL3 574
## C1orf112 397
## FGR 2
## MUSCLE_AGE31_M_GROUP20_34_AA_T230_Gt025_R1_COUNT
## TSPAN6 117
## TNMD 4
## DPM1 249
## SCYL3 329
## C1orf112 251
## FGR 17
## MUSCLE_AGE31_M_GROUP20_34_AA_T328_Gt005_R1_COUNT
## TSPAN6 122
## TNMD 269
## DPM1 302
## SCYL3 463
## C1orf112 430
## FGR 12
## MUSCLE_AGE33_M_GROUP20_34_CC_T340_Gt042_R1_COUNT
## TSPAN6 198
## TNMD 6
## DPM1 160
## SCYL3 500
## C1orf112 422
## FGR 57
## MUSCLE_AGE35_M_GROUP35_49_CC_T670_Gt104_R1_COUNT
## TSPAN6 44
## TNMD 6
## DPM1 153
## SCYL3 285
## C1orf112 256
## FGR 111
## MUSCLE_AGE37_F_GROUP35_49_AA_T120_Gt037_R1_COUNT
## TSPAN6 156
## TNMD 58
## DPM1 197
## SCYL3 365
## C1orf112 305
## FGR 19
## MUSCLE_AGE37_M_GROUP35_49_AA_T460_Gt064_R1_COUNT
## TSPAN6 78
## TNMD 1
## DPM1 212
## SCYL3 320
## C1orf112 224
## FGR 23
## MUSCLE_AGE38_M_GROUP35_49_CC_T232_Gt016_R1_COUNT
## TSPAN6 113
## TNMD 3
## DPM1 260
## SCYL3 378
## C1orf112 364
## FGR 22
## MUSCLE_AGE38_M_GROUP35_49_CC_T743_Gt093_R1_COUNT
## TSPAN6 121
## TNMD 14
## DPM1 251
## SCYL3 685
## C1orf112 373
## FGR 25
## MUSCLE_AGE42_F_GROUP35_49_CC_T445_Gt059_R1_COUNT
## TSPAN6 75
## TNMD 12
## DPM1 269
## SCYL3 679
## C1orf112 504
## FGR 35
## MUSCLE_AGE42_M_GROUP35_49_AA_T297_Gt024_R1_COUNT
## TSPAN6 133
## TNMD 0
## DPM1 660
## SCYL3 807
## C1orf112 711
## FGR 14
## MUSCLE_AGE45_M_GROUP35_49_AA_T292_Gt030_R1_COUNT
## TSPAN6 138
## TNMD 10
## DPM1 302
## SCYL3 600
## C1orf112 468
## FGR 16
## MUSCLE_AGE45_M_GROUP35_49_AA_T325_Gt017_R1_COUNT
## TSPAN6 130
## TNMD 4
## DPM1 220
## SCYL3 453
## C1orf112 341
## FGR 2
## MUSCLE_AGE47_F_GROUP35_49_CC_T490_Gt078_R1_COUNT
## TSPAN6 122
## TNMD 12
## DPM1 188
## SCYL3 481
## C1orf112 431
## FGR 25
## MUSCLE_AGE47_M_GROUP35_49_AA_T475_Gt060_R1_COUNT
## TSPAN6 146
## TNMD 2
## DPM1 297
## SCYL3 484
## C1orf112 337
## FGR 38
## MUSCLE_AGE51_M_GROUP50_64_CC_T289_Gt015_R1_COUNT
## TSPAN6 124
## TNMD 5
## DPM1 338
## SCYL3 617
## C1orf112 524
## FGR 13
## MUSCLE_AGE52_F_GROUP50_64_AA_T535_Gt063_R1_COUNT
## TSPAN6 69
## TNMD 72
## DPM1 68
## SCYL3 336
## C1orf112 240
## FGR 27
## MUSCLE_AGE52_F_GROUP50_64_CC_T331_Gt027_R1_COUNT
## TSPAN6 84
## TNMD 0
## DPM1 229
## SCYL3 237
## C1orf112 207
## FGR 25
## MUSCLE_AGE52_M_GROUP50_64_AA_T505_Gt056_R1_COUNT
## TSPAN6 152
## TNMD 39
## DPM1 418
## SCYL3 965
## C1orf112 759
## FGR 78
## MUSCLE_AGE52_M_GROUP50_64_CC_T329_Gt007_R1_COUNT
## TSPAN6 58
## TNMD 6
## DPM1 271
## SCYL3 321
## C1orf112 251
## FGR 5
## MUSCLE_AGE54_M_GROUP50_64_AA_T520_Gt057_R1_COUNT
## TSPAN6 67
## TNMD 3
## DPM1 344
## SCYL3 484
## C1orf112 428
## FGR 40
## MUSCLE_AGE57_M_GROUP50_64_CC_T266_Gt004_R1_COUNT
## TSPAN6 107
## TNMD 9
## DPM1 395
## SCYL3 772
## C1orf112 632
## FGR 11
## MUSCLE_AGE58_F_GROUP50_64_CC_T330_Gt011_R1_COUNT
## TSPAN6 206
## TNMD 15
## DPM1 198
## SCYL3 584
## C1orf112 489
## FGR 104
## MUSCLE_AGE60_M_GROUP50_64_CC_T300_Gt013_R1_COUNT
## TSPAN6 263
## TNMD 7
## DPM1 329
## SCYL3 494
## C1orf112 371
## FGR 9
## MUSCLE_AGE62_F_GROUP50_64_CC_T550_Gt070_R1_COUNT
## TSPAN6 104
## TNMD 6
## DPM1 373
## SCYL3 968
## C1orf112 651
## FGR 42
## MUSCLE_AGE63_F_GROUP50_64_CC_T676_Gt101_R1_COUNT
## TSPAN6 111
## TNMD 5
## DPM1 188
## SCYL3 355
## C1orf112 256
## FGR 62
## MUSCLE_AGE63_M_GROUP50_64_CC_T685_Gt092_R1_COUNT
## TSPAN6 111
## TNMD 102
## DPM1 236
## SCYL3 508
## C1orf112 447
## FGR 30
## MUSCLE_AGE67_F_GROUP65_79_CC_T610_Gt080_R1_COUNT
## TSPAN6 99
## TNMD 6
## DPM1 254
## SCYL3 548
## C1orf112 414
## FGR 35
## MUSCLE_AGE67_M_GROUP65_79_AA_T295_Gt039_R1_COUNT
## TSPAN6 150
## TNMD 29
## DPM1 342
## SCYL3 434
## C1orf112 375
## FGR 7
## MUSCLE_AGE67_M_GROUP65_79_CC_T228_Gt002_R1_COUNT
## TSPAN6 122
## TNMD 19
## DPM1 375
## SCYL3 686
## C1orf112 644
## FGR 46
## MUSCLE_AGE69_F_GROUP65_79_CC_T715_Gt095_R1_COUNT
## TSPAN6 108
## TNMD 7
## DPM1 172
## SCYL3 321
## C1orf112 344
## FGR 41
## MUSCLE_AGE70_M_GROUP65_79_AS_T625_Gt089_R1_COUNT
## TSPAN6 80
## TNMD 6
## DPM1 221
## SCYL3 519
## C1orf112 444
## FGR 39
## MUSCLE_AGE70_M_GROUP65_79_CC_T595_Gt061_R1_COUNT
## TSPAN6 101
## TNMD 6
## DPM1 247
## SCYL3 456
## C1orf112 410
## FGR 43
## MUSCLE_AGE72_F_GROUP65_79_CC_T296_Gt018_R1_COUNT
## TSPAN6 114
## TNMD 1
## DPM1 363
## SCYL3 417
## C1orf112 522
## FGR 11
## MUSCLE_AGE72_F_GROUP65_79_CC_T580_Gt071_R1_COUNT
## TSPAN6 122
## TNMD 7
## DPM1 246
## SCYL3 468
## C1orf112 360
## FGR 29
## MUSCLE_AGE72_M_GROUP65_79_CC_T298_Gt014_R1_COUNT
## TSPAN6 112
## TNMD 5
## DPM1 247
## SCYL3 320
## C1orf112 295
## FGR 30
## MUSCLE_AGE72_M_GROUP65_79_CC_T730_Gt084_R1_COUNT
## TSPAN6 129
## TNMD 2
## DPM1 213
## SCYL3 371
## C1orf112 282
## FGR 37
## MUSCLE_AGE72_M_GROUP65_79_CC_T760_Gt109_R1_COUNT
## TSPAN6 106
## TNMD 18
## DPM1 255
## SCYL3 554
## C1orf112 399
## FGR 64
## MUSCLE_AGE73_M_GROUP65_79_CC_T565_Gt065_R1_COUNT
## TSPAN6 66
## TNMD 1
## DPM1 179
## SCYL3 516
## C1orf112 383
## FGR 42
## MUSCLE_AGE76_M_GROUP65_79_CC_T745_Gt098_R1_COUNT
## TSPAN6 52
## TNMD 25
## DPM1 109
## SCYL3 426
## C1orf112 278
## FGR 51
## MUSCLE_AGE80_F_GROUP80_PLUS_CC_T290_Gt032_R1_COUNT
## TSPAN6 110
## TNMD 51
## DPM1 40
## SCYL3 206
## C1orf112 219
## FGR 39
## MUSCLE_AGE81_M_GROUP80_PLUS_CC_T291_Gt029_R1_COUNT
## TSPAN6 93
## TNMD 6
## DPM1 335
## SCYL3 457
## C1orf112 436
## FGR 6
## MUSCLE_AGE81_M_GROUP80_PLUS_CC_T790_Gt110_R1_COUNT
## TSPAN6 40
## TNMD 8
## DPM1 111
## SCYL3 363
## C1orf112 262
## FGR 58
## MUSCLE_AGE81_M_GROUP80_PLUS_CC_T805_Gt108_R1_COUNT
## TSPAN6 52
## TNMD 18
## DPM1 117
## SCYL3 314
## C1orf112 212
## FGR 29
## MUSCLE_AGE83_M_GROUP80_PLUS_CC_T294_Gt026_R1_COUNT HG19en82 Chromosome
## TSPAN6 117 chrX
## TNMD 15 chrX
## DPM1 345 chr20
## SCYL3 745 chr1
## C1orf112 636 chr1
## FGR 4 chr1
## HG19en82 Gene Start (bp) HG19en82 Gene End (bp) HG19en82 Gene Length
## TSPAN6 99883667 99894988 2968
## TNMD 99839799 99854882 1610
## DPM1 49551404 49575092 1207
## SCYL3 169818772 169863408 6876
## C1orf112 169631245 169823221 6354
## FGR 27938575 27961788 3474
## HG19en82 Strand HG19en82 % GC Content HG19en82 Cytoband
## TSPAN6 - 40.87 q22.1
## TNMD + 40.80 q22.1
## DPM1 - 39.85 q13.13
## SCYL3 - 40.14 q24.2
## C1orf112 + 39.22 q24.2
## FGR - 52.97 p36.11
## HG19en82 Gene Biotype HG19en82 Gene Name
## TSPAN6 protein_coding TSPAN6
## TNMD protein_coding TNMD
## DPM1 protein_coding DPM1
## SCYL3 protein_coding SCYL3
## C1orf112 protein_coding C1orf112
## FGR protein_coding FGR
## HG19en82 Description
## TSPAN6 tetraspanin 6 [Source:HGNC Symbol;Acc:11858]
## TNMD tenomodulin [Source:HGNC Symbol;Acc:17757]
## DPM1 dolichyl-phosphate mannosyltransferase polypeptide 1, catalytic subunit [Source:HGNC Symbol;Acc:3005]
## SCYL3 SCY1-like 3 (S. cerevisiae) [Source:HGNC Symbol;Acc:19285]
## C1orf112 chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:25565]
## FGR feline Gardner-Rasheed sarcoma viral oncogene homolog [Source:HGNC Symbol;Acc:3697]
# Load normalized count data for filtered genes. The trimmed mean method (TMM) was used for normalization.
# Restore proper formatting from text file
norm_count <- read.table("normalized_counts.txt", check.names = FALSE)
# norm_count <- read.table("A2_resources/normalized_counts.txt", check.names = FALSE)
colnames(norm_count) <- norm_count[1, 1:ncol(norm_count)]
rownames(norm_count) <- norm_count[1:nrow(norm_count), 1]
norm_count <- norm_count[-1,-1]
head(norm_count)
## MUSCLE_AGE22_M_GROUP20_34_CC_T415_Gt073_R1_COUNT
## TSPAN6 0.495633004144666
## TNMD 0.578238504835443
## DPM1 1.12343480939458
## SCYL3 3.15553012638771
## C1orf112 3.4529099288745
## FGR 1.07387150898011
## MUSCLE_AGE23_M_GROUP20_34_CC_T293_Gt022_R1_COUNT
## TSPAN6 1.08734841704813
## TNMD 0.267068032257436
## DPM1 5.76103898155326
## SCYL3 7.05822656680366
## C1orf112 8.20280384790695
## FGR 0.476907200459707
## MUSCLE_AGE25_F_GROUP20_34_CC_T430_Gt085_R1_COUNT
## TSPAN6 2.52854656354576
## TNMD 0
## DPM1 4.56237749509345
## SCYL3 10.7554842956621
## C1orf112 8.6300393581888
## FGR 0.421424427257627
## MUSCLE_AGE25_M_GROUP20_34_AS_T355_Gt046_R1_COUNT
## TSPAN6 1.66130477843066
## TNMD 0.041019871072362
## DPM1 5.37360311047942
## SCYL3 10.2344578325543
## C1orf112 7.24000724427189
## FGR 0.512748388404525
## MUSCLE_AGE26_F_GROUP20_34_CC_T370_Gt051_R1_COUNT
## TSPAN6 1.48630884670204
## TNMD 0.136497751227738
## DPM1 4.00393403601365
## SCYL3 6.80972114458382
## C1orf112 5.00491754501706
## FGR 0.743154423351019
## MUSCLE_AGE26_F_GROUP20_34_CC_T400_Gt045_R1_COUNT
## TSPAN6 3.39248571086676
## TNMD 0.194361160518408
## DPM1 5.44211249451543
## SCYL3 7.08534776071652
## C1orf112 5.72481963708766
## FGR 0.618421874376753
## MUSCLE_AGE27_F_GROUP20_34_CC_T385_Gt055_R1_COUNT
## TSPAN6 2.43624428127776
## TNMD 0.0928093059534386
## DPM1 2.80748150509152
## SCYL3 6.79828166108938
## C1orf112 5.54535603071796
## FGR 0.881688406557667
## MUSCLE_AGE28_M_GROUP20_34_CC_T326_Gt001_R1_COUNT
## TSPAN6 4.58056938804357
## TNMD 0.276215239480014
## DPM1 7.38875765609038
## SCYL3 14.3977193578957
## C1orf112 10.7493764030972
## FGR 0.0920717464933381
## MUSCLE_AGE31_F_GROUP20_34_AS_T640_Gt090_R1_COUNT
## TSPAN6 2.97104297440637
## TNMD 0.125284944703883
## DPM1 3.65116124565602
## SCYL3 10.2733654657184
## C1orf112 7.10544614963452
## FGR 0.0357956984868238
## MUSCLE_AGE31_M_GROUP20_34_AA_T230_Gt025_R1_COUNT
## TSPAN6 3.70045713497259
## TNMD 0.126511355041798
## DPM1 7.87533185135192
## SCYL3 10.4055589521879
## C1orf112 7.93858752887282
## FGR 0.537673258927641
## MUSCLE_AGE31_M_GROUP20_34_AA_T328_Gt005_R1_COUNT
## TSPAN6 2.02414123204382
## TNMD 4.46306550344088
## DPM1 5.01057911538716
## SCYL3 7.68178188882204
## C1orf112 7.13426827687576
## FGR 0.199095858889556
## MUSCLE_AGE33_M_GROUP20_34_CC_T340_Gt042_R1_COUNT
## TSPAN6 3.04293369487199
## TNMD 0.092210111965818
## DPM1 2.45893631908848
## SCYL3 7.6841759971515
## C1orf112 6.48544454159586
## FGR 0.875996063675271
## MUSCLE_AGE35_M_GROUP35_49_CC_T670_Gt104_R1_COUNT
## TSPAN6 0.69301546223257
## TNMD 0.0945021084862595
## DPM1 2.40980376639962
## SCYL3 4.48885015309733
## C1orf112 4.03208996208041
## FGR 1.7482890069958
## MUSCLE_AGE37_F_GROUP35_49_AA_T120_Gt037_R1_COUNT
## TSPAN6 2.85926366961664
## TNMD 1.06305956947285
## DPM1 3.61073681355434
## SCYL3 6.68994384237226
## C1orf112 5.59022704636586
## FGR 0.348243652068693
## MUSCLE_AGE37_M_GROUP35_49_AA_T460_Gt064_R1_COUNT
## TSPAN6 1.72294853327264
## TNMD 0.0220890837599056
## DPM1 4.6828857571
## SCYL3 7.06850680316981
## C1orf112 4.94795476221887
## FGR 0.50804892647783
## MUSCLE_AGE38_M_GROUP35_49_CC_T232_Gt016_R1_COUNT
## TSPAN6 2.93231394109958
## TNMD 0.0778490426840596
## DPM1 6.7469170326185
## SCYL3 9.80897937819151
## C1orf112 9.4456838456659
## FGR 0.570892979683104
## MUSCLE_AGE38_M_GROUP35_49_CC_T743_Gt093_R1_COUNT
## TSPAN6 1.65781066052541
## TNMD 0.191812803697155
## DPM1 3.43892955199899
## SCYL3 9.38512646661079
## C1orf112 5.11044112707419
## FGR 0.342522863744919
## MUSCLE_AGE42_F_GROUP35_49_CC_T445_Gt059_R1_COUNT
## TSPAN6 1.21152253824014
## TNMD 0.193843606118422
## DPM1 4.3453275038213
## SCYL3 10.9683173795341
## C1orf112 8.14143145697374
## FGR 0.565377184512065
## MUSCLE_AGE42_M_GROUP35_49_AA_T297_Gt024_R1_COUNT
## TSPAN6 1.93664808282571
## TNMD 0
## DPM1 9.61043409522533
## SCYL3 11.7509398709801
## C1orf112 10.3530585480382
## FGR 0.203857692929022
## MUSCLE_AGE45_M_GROUP35_49_AA_T292_Gt030_R1_COUNT
## TSPAN6 2.04923007325138
## TNMD 0.148494932844303
## DPM1 4.48454697189795
## SCYL3 8.90969597065817
## C1orf112 6.94956285711338
## FGR 0.237591892550885
## MUSCLE_AGE45_M_GROUP35_49_AA_T325_Gt017_R1_COUNT
## TSPAN6 3.91721639905587
## TNMD 0.120529735355565
## DPM1 6.62913544455609
## SCYL3 13.6499925290178
## C1orf112 10.2751599390619
## FGR 0.0602648676777826
## MUSCLE_AGE47_F_GROUP35_49_CC_T490_Gt078_R1_COUNT
## TSPAN6 2.08780115679835
## TNMD 0.205357490832625
## DPM1 3.21726735637779
## SCYL3 8.23141275754104
## C1orf112 7.37575654573843
## FGR 0.427828105901301
## MUSCLE_AGE47_M_GROUP35_49_AA_T475_Gt060_R1_COUNT
## TSPAN6 2.26094084644405
## TNMD 0.0309717924170417
## DPM1 4.5993111739307
## SCYL3 7.4951737649241
## C1orf112 5.21874702227153
## FGR 0.588464055923793
## MUSCLE_AGE51_M_GROUP50_64_CC_T289_Gt015_R1_COUNT
## TSPAN6 2.44554975221804
## TNMD 0.098610877105566
## DPM1 6.66609529233626
## SCYL3 12.1685822348269
## C1orf112 10.3344199206633
## FGR 0.256388280474472
## MUSCLE_AGE52_F_GROUP50_64_AA_T535_Gt063_R1_COUNT
## TSPAN6 1.8385794802239
## TNMD 1.91851771849451
## DPM1 1.81193340080037
## SCYL3 8.95308268630771
## C1orf112 6.39505906164836
## FGR 0.719444144435441
## MUSCLE_AGE52_F_GROUP50_64_CC_T331_Gt027_R1_COUNT
## TSPAN6 2.2635487463469
## TNMD 0
## DPM1 6.17086503468381
## SCYL3 6.38644110576447
## C1orf112 5.578030839212
## FGR 0.673675222127054
## MUSCLE_AGE52_M_GROUP50_64_AA_T505_Gt056_R1_COUNT
## TSPAN6 1.66829155323219
## TNMD 0.428048490631943
## DPM1 4.58780177138851
## SCYL3 10.5914562425596
## C1orf112 8.33048216383704
## FGR 0.856096981263885
## MUSCLE_AGE52_M_GROUP50_64_CC_T329_Gt007_R1_COUNT
## TSPAN6 1.75026757200975
## TNMD 0.181062162621698
## DPM1 8.17797434508003
## SCYL3 9.68682570026085
## C1orf112 7.5744338030077
## FGR 0.150885135518082
## MUSCLE_AGE54_M_GROUP50_64_AA_T520_Gt057_R1_COUNT
## TSPAN6 0.814397013907725
## TNMD 0.0364655379361668
## DPM1 4.18138168334713
## SCYL3 5.88310678703491
## C1orf112 5.2024167455598
## FGR 0.486207172482224
## MUSCLE_AGE57_M_GROUP50_64_CC_T266_Gt004_R1_COUNT
## TSPAN6 1.90107736665636
## TNMD 0.159903703737451
## DPM1 7.01799588625478
## SCYL3 13.7161843650347
## C1orf112 11.2287934180077
## FGR 0.195437860123551
## MUSCLE_AGE58_F_GROUP50_64_CC_T330_Gt011_R1_COUNT
## TSPAN6 3.74986148520786
## TNMD 0.27304816639863
## DPM1 3.60423579646192
## SCYL3 10.6306752784533
## C1orf112 8.90137022459535
## FGR 1.89313395369717
## MUSCLE_AGE60_M_GROUP50_64_CC_T300_Gt013_R1_COUNT
## TSPAN6 3.36262730901047
## TNMD 0.0894995861713813
## DPM1 4.20648055005492
## SCYL3 6.31611365266605
## C1orf112 4.74347806708321
## FGR 0.115070896506062
## MUSCLE_AGE62_F_GROUP50_64_CC_T550_Gt070_R1_COUNT
## TSPAN6 1.34756987621238
## TNMD 0.0777444159353297
## DPM1 4.83311119064633
## SCYL3 12.5427657708999
## C1orf112 8.43526912898327
## FGR 0.544210911547308
## MUSCLE_AGE63_F_GROUP50_64_CC_T676_Gt101_R1_COUNT
## TSPAN6 1.88087653285919
## TNMD 0.0847241681468106
## DPM1 3.18562872232008
## SCYL3 6.01541593842355
## C1orf112 4.3378774091167
## FGR 1.05057968502045
## MUSCLE_AGE63_M_GROUP50_64_CC_T685_Gt092_R1_COUNT
## TSPAN6 1.61913463112672
## TNMD 1.48785344481915
## DPM1 3.44248444095411
## SCYL3 7.4100936271385
## C1orf112 6.52029891994274
## FGR 0.437603954358573
## MUSCLE_AGE67_F_GROUP65_79_CC_T610_Gt080_R1_COUNT
## TSPAN6 1.62612528375301
## TNMD 0.0985530475001827
## DPM1 4.17207901084107
## SCYL3 9.00117833835002
## C1orf112 6.80016027751261
## FGR 0.574892777084399
## MUSCLE_AGE67_M_GROUP65_79_AA_T295_Gt039_R1_COUNT
## TSPAN6 2.2023732043387
## TNMD 0.425792152838816
## DPM1 5.02141090589224
## SCYL3 6.37219980455331
## C1orf112 5.50593301084676
## FGR 0.102777416202473
## MUSCLE_AGE67_M_GROUP65_79_CC_T228_Gt002_R1_COUNT
## TSPAN6 2.24302570890385
## TNMD 0.34932367597683
## DPM1 6.8945462363848
## SCYL3 12.6124232484266
## C1orf112 11.8402340699515
## FGR 0.845731004996535
## MUSCLE_AGE69_F_GROUP65_79_CC_T715_Gt095_R1_COUNT
## TSPAN6 2.1753375088858
## TNMD 0.140994097798154
## DPM1 3.46442640304035
## SCYL3 6.46558648474392
## C1orf112 6.92885280608071
## FGR 0.825822572817759
## MUSCLE_AGE70_M_GROUP65_79_AS_T625_Gt089_R1_COUNT
## TSPAN6 1.47235523978391
## TNMD 0.110426642983793
## DPM1 4.06738134990306
## SCYL3 9.55190461809813
## C1orf112 8.17157158080071
## FGR 0.717773179394657
## MUSCLE_AGE70_M_GROUP65_79_CC_T595_Gt061_R1_COUNT
## TSPAN6 2.09849085464065
## TNMD 0.12466282304796
## DPM1 5.131952882141
## SCYL3 9.47437455164493
## C1orf112 8.51862624161057
## FGR 0.893416898510377
## MUSCLE_AGE72_F_GROUP65_79_CC_T296_Gt018_R1_COUNT
## TSPAN6 2.84101672473088
## TNMD 0.0249211993397446
## DPM1 9.04639536032728
## SCYL3 10.3921401246735
## C1orf112 13.0088660553467
## FGR 0.27413319273719
## MUSCLE_AGE72_F_GROUP65_79_CC_T580_Gt071_R1_COUNT
## TSPAN6 2.39477254491478
## TNMD 0.137404982085274
## DPM1 4.82880365613965
## SCYL3 9.18650451655835
## C1orf112 7.06654193581412
## FGR 0.569249211496137
## MUSCLE_AGE72_M_GROUP65_79_CC_T298_Gt014_R1_COUNT
## TSPAN6 2.88336976944442
## TNMD 0.12872186470734
## DPM1 6.35886011654261
## SCYL3 8.23819934126978
## C1orf112 7.59459001773307
## FGR 0.772331188244042
## MUSCLE_AGE72_M_GROUP65_79_CC_T730_Gt084_R1_COUNT
## TSPAN6 2.12175820816641
## TNMD 0.0328954760956033
## DPM1 3.50336820418175
## SCYL3 6.10211081573441
## C1orf112 4.63826212948006
## FGR 0.60856630776866
## MUSCLE_AGE72_M_GROUP65_79_CC_T760_Gt109_R1_COUNT
## TSPAN6 1.48049087054646
## TNMD 0.251404110092795
## DPM1 3.5615582263146
## SCYL3 7.73765983285603
## C1orf112 5.57279110705696
## FGR 0.893881280329938
## MUSCLE_AGE73_M_GROUP65_79_CC_T565_Gt065_R1_COUNT
## TSPAN6 1.1221350455523
## TNMD 0.0170020461447319
## DPM1 3.04336625990701
## SCYL3 8.77305581068165
## C1orf112 6.51178367343231
## FGR 0.714085938078739
## MUSCLE_AGE76_M_GROUP65_79_CC_T745_Gt098_R1_COUNT
## TSPAN6 0.914941886947357
## TNMD 0.439875907186229
## DPM1 1.91785895533196
## SCYL3 7.49548545845335
## C1orf112 4.89142008791087
## FGR 0.897346850659908
## MUSCLE_AGE80_F_GROUP80_PLUS_CC_T290_Gt032_R1_COUNT
## TSPAN6 1.98110467083993
## TNMD 0.918512165571242
## DPM1 0.720401698487249
## SCYL3 3.71006874720933
## C1orf112 3.94419929921769
## FGR 0.702391656025068
## MUSCLE_AGE81_M_GROUP80_PLUS_CC_T291_Gt029_R1_COUNT
## TSPAN6 2.25340931993861
## TNMD 0.145381246447652
## DPM1 8.11711959332727
## SCYL3 11.0732049377629
## C1orf112 10.5643705751961
## FGR 0.145381246447652
## MUSCLE_AGE81_M_GROUP80_PLUS_CC_T790_Gt110_R1_COUNT
## TSPAN6 0.927187455996697
## TNMD 0.185437491199339
## DPM1 2.57294519039084
## SCYL3 8.41422616317003
## C1orf112 6.07307783677837
## FGR 1.34442181119521
## MUSCLE_AGE81_M_GROUP80_PLUS_CC_T805_Gt108_R1_COUNT
## TSPAN6 1.31350438923174
## TNMD 0.454674596272524
## DPM1 2.95538487577141
## SCYL3 7.93154573497626
## C1orf112 5.35505635609862
## FGR 0.732531293994623
## MUSCLE_AGE83_M_GROUP80_PLUS_CC_T294_Gt026_R1_COUNT
## TSPAN6 1.95716825454737
## TNMD 0.250919006993253
## DPM1 5.77113716084482
## SCYL3 12.4623106806649
## C1orf112 10.6389658965139
## FGR 0.0669117351982008
# First define the grouping of samples:
grouping <-data.frame(lapply(colnames(norm_count),
FUN=function(x){unlist(strsplit(x, split = "_"))[c(3,4)]}))
colnames(grouping) <- colnames(norm_count)
rownames(grouping) <- c("sex","age")
grouping <- data.frame(t(grouping))
filtered_RNA_matrix <- as.matrix(RNAfiltered[3:55])
filtered_RNA_matrix[1:3,1:3]
## MUSCLE_AGE22_M_GROUP20_34_CC_T415_Gt073_R1_COUNT
## TSPAN6 30
## TNMD 35
## DPM1 68
## MUSCLE_AGE23_M_GROUP20_34_CC_T293_Gt022_R1_COUNT
## TSPAN6 57
## TNMD 14
## DPM1 302
## MUSCLE_AGE25_F_GROUP20_34_CC_T430_Gt085_R1_COUNT
## TSPAN6 138
## TNMD 0
## DPM1 249
# Create the DEGList object and calculate normalization factors:
d = DGEList(counts=filtered_RNA_matrix, group=grouping$age)
d = calcNormFactors(d, method = "TMM") #trimmed mean method
short <- sapply(strsplit(colnames(RNAfiltered)[2:54], split = "_"),
function(x) paste(x[4], collapse = '_'))
# No clear cluster by age group...
plotMDS(data.matrix(norm_count), labels=short, cex=0.6,
col=c("blue", "green", "pink", "red", "black")[factor(grouping$age)])
title(main = "MDS plot -- Group by Age")
legend("topleft", legend=c("Age 20~34", "Age 35~49", "Age 50~64", "Age 65~79", "Age > 80"), fill=c("blue", "green", "pink", "red", "black"), cex=0.8)
# No clustering by sex either...
plotMDS(data.matrix(norm_count), labels=short, cex=0.8,
col=c("pink", "blue")[factor(grouping$sex)])
title(main = "MDS plot --- Group by Sex")
legend("topleft", legend=c("Female", "Male"), fill=c("pink", "blue"), cex=0.8)
# hmm.., sex and age are the only covariates used in this experiment, so this RNA-seq data must be very dispersed
### GENERALIZED LINEAR MODEL using Age & Sex
model_design <- model.matrix(~grouping$age+grouping$sex+0)
kable(model_design[1:5, 1:6], type = "html") %>% kableExtra::row_spec(0, angle=-20)
| grouping\(ageGROUP20 </th> <th style="text-align:right;-webkit-transform: rotate(-20deg); -moz-transform: rotate(-20deg); -ms-transform: rotate(-20deg); -o-transform: rotate(-20deg); transform: rotate(-20deg);"> grouping\)ageGROUP35 | grouping\(ageGROUP50 </th> <th style="text-align:right;-webkit-transform: rotate(-20deg); -moz-transform: rotate(-20deg); -ms-transform: rotate(-20deg); -o-transform: rotate(-20deg); transform: rotate(-20deg);"> grouping\)ageGROUP65 | grouping\(ageGROUP80 </th> <th style="text-align:right;-webkit-transform: rotate(-20deg); -moz-transform: rotate(-20deg); -ms-transform: rotate(-20deg); -o-transform: rotate(-20deg); transform: rotate(-20deg);"> grouping\)sexM | |||
|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 |
# Estimate dispersion according to model design
if(!exists('est_d')){
est_d <- estimateDisp(d, model_design)
}
# Fit the model to data
fit <- glmQLFit(est_d, model_design)
# Use the Genewise Negative Binomial Generalized Linear Models with Quasi-likelihood Tests from edgeR to calculate differential expression of bulk RNA-seq data
qlf_output <- glmQLFTest(fit)
# Benjamini Hochberg Multiple Testing Correction is applied by function topTags()
qlf_output_hits <- topTags(qlf_output,
adjust.method = "BH",
sort.by = "PValue",
n = nrow(norm_count))
qlf_output_hits$table[1:5,]
## logFC logCPM F PValue FDR
## ZFY 6.197065 1.9294452 276.6287 1.362372e-23 2.590671e-19
## TTTY14 8.983947 1.4749536 276.0184 1.435911e-23 2.590671e-19
## HSFY2 8.520869 -0.4014364 240.8904 3.527209e-22 4.242527e-18
## EIF1AY 9.526244 3.0596327 235.4057 6.015401e-22 5.426493e-18
## KDM5D 7.188870 3.8722766 216.3795 4.156852e-21 2.999917e-17
### False Discovery Rate (FDR) p-Value Correction:
# Note: Benjamini-Hochberg (aka "fdr") method is used here to control the false discovery rate, meaning the expected proportion of false discoveries among rejected hypotheses.
plot(qlf_output_hits$table$logFC,-log10(qlf_output_hits$table$FDR),xlab="Log Fold Change (logFC)", ylab="-log10(adj p-val)",
main="FDR Adjusted p-Values from Linear Regressions")
legend("topleft",c("p=0.05","p=0.01"),col=c("red","blue"),
lty=1:1,cex=0.8)
abline(1.3,0,col="red")
abline(2,0,col="blue")
abline(v=0,col="black")
# Genes that passed the threshold p-value < 0.05: 2292
# Genes that passed the threshold p-value < 0.01: 899
# Genes up-regulated/over-represented with age, under p-value < 0.01: 578
# Genes down-regulated/under-represented with age, under p-value < 0.01: 321
# Genes that passed the BH multiple testing correction: 397
# 2292 genes passed the threshold p-value < 0.05
glm_total_0.05 <- length(which(qlf_output_hits$table$PValue < 0.05))
# 899 genes passed the threshold p-value < 0.01
glm_total_0.01 <- length(which(qlf_output_hits$table$PValue < 0.01))
# 397 genes passed BH multiple testing correction
glm_corrected <- length(which(qlf_output_hits$table$FDR < 0.05))
# Rank output hits by +/- log fold change
qlf_output_ranked <- qlf_output_hits$table
qlf_output_ranked[, "rank"] <-
-log(qlf_output_ranked$PValue, base = 10) * sign(qlf_output_ranked$logFC)
qlf_output_ranked <- qlf_output_ranked[order(qlf_output_ranked$rank),]
# We will proceed with gene lists thresholded by P=0.01 to be more stringent with our selection:
# Under p-value < 0.01, there are 578 up-regulated/over-represented genes
glm_over <- rownames(qlf_output_ranked)[which(qlf_output_ranked$PValue < 0.01
& qlf_output_ranked$logFC > 0)]
# Under p-value < 0.01, there are 321 down-regulated/under-represented genes
glm_under <- rownames(qlf_output_ranked)[which(qlf_output_ranked$PValue < 0.01
& qlf_output_ranked$logFC < 0)]
glm_total <- rownames(qlf_output_ranked)[which(qlf_output_ranked$PValue < 0.01)]
write.table(glm_total,
file = file.path("/Users/apple/bcb420_code/DE_data","total_byGLM.txt"),sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
# Write analyzed gene list as text files
write.table(glm_over,
file = file.path("/Users/apple/bcb420_code/DE_data","skeletalMuscle_upregulated_genes_byGLM.txt"), sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(glm_under,
file = file.path("/Users/apple/bcb420_code/DE_data","skeletalMuscle_downregulated_genes_byGLM.txt"),sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
### GENERALIZED LINEAR Model volcano plot
glin.info <- data.frame(PValue = qlf_output$table$PValue, logFC = qlf_output_ranked$logFC)
# head(lin.info)
glin.info$threshold <- factor(ifelse(glin.info$PValue<0.01 & glin.info$logFC < 0, 1,
ifelse(glin.info$PValue<0.01 & glin.info$logFC > 0,-1,0)))
options(scipen=999) # turn-off scientific notation like 1e+48
glinear_Vplot <- ggplot(glin.info, aes(x=logFC, y=-log10(PValue))) +
geom_point(aes(color=glin.info$threshold),show.legend = T)+
scale_color_manual(values = c("0" = "black", "-1" = "red", "1"="blue"), labels = c("Not significant", "Over-represented", "Under-represented"))+
scale_shape(solid = F)+
geom_hline(aes(yintercept= -log10(0.01), linetype = "p = 0.01"), colour= 'black',linetype="dashed") +
geom_vline(xintercept = 0,color="black")+
xlim(c(-0.07,0.07))+ylim(c(0,10))+
theme(text = element_text(size=10))+
labs(y="-log10(p-val)",
x="log Fold Change",
title = "Negative Binomial Generalized Linear Model of Significant RNAs",
color = "Legend")
plot(glinear_Vplot)
### Heat Map of top hits using Generalized Linear Model (P < 0.05)
top_hits <- rownames(qlf_output_hits$table)[qlf_output_hits$table$PValue<0.05]
heatmap_matrix <- data.matrix(norm_count[which(rownames(norm_count) %in% top_hits),])
# Age specifiers
Age=c(22,23,25,25,26,26,27,28,31,31,31,33,35,37,37,38,38,42,42,45,45,47,
47,51,52,52,52,52,54,57,58,60,62,63,63,67,67,67,69,70,70,72,72,72,
72,72,73,76,80,81,81,81,83)
GLM_HeatMap = heatmap.2(heatmap_matrix, #input CPM data for RNAs
xlab="Age (yr)",
main="Generalized Linear Model",
labCol = Age, #participants' ages labeled on x-axis
Colv = NA, #hide column dendrogram
Rowv = T, #show row dendrogram
scale="row", #values centered and scaled in row direction
dendrogram = "row", #create row dendrogram
labRow = numeric(), #adds y-axis label
col=brewer.pal(4,"YlGnBu"), #specify color scheme and how many colors we'll use from it
trace = "none", #add lines row-wise and/or column-wise to show proportionality of measurement (e.g. if trace="row", row-wise lines with express how large z-scores are for every cell) in this case this option adds too much noise to the heat map
margins = c(5, 2),
key=TRUE, #add key to show z-scores (standard deviation)
key.title= "SD", #add key title
key.xlab="Row Z-Score", #add x-axis label to key
key.ylab = "CPM", #add y-axis label to key
cexCol = 1, #change size of axis labels
density.info = "none", #shows distribution of z-scores (RNA expression levels) of all cells in key (usually normal)
lhei = c(1,4), #adjust height and width of heat map
key.par = list(cex=0.7,mgp=c(1.5, 0.5, 0), mar=c(3, 2.5, 1, 0)), #adjust size of key
distfun = function(heatmap_matrix) as.dist(1-cor(t(heatmap_matrix))), #distance function
hclustfun = function(heatmap_matrix) hclust(heatmap_matrix, method = "complete", members = NULL) #clustering method
)
Heat map representing the expression levels of 2292 significant RNAs with age, found by the Generalized Linear Model under P < 0.05. Row-wise z-scores were calculated and color-coded to display changes in RNA expression with increasing age (22–83 years old). Yellow and blue bars refer to RNAs with very low and very high expression levels respectively, at a particular age.
#################################################################
### LINEAR MODEL with detailed age and sex
norm_count=as.data.frame(cpm(d,log=T))
# Set up P and Beta values
p.val.lin=numeric()
beta.lin=numeric()
count0=numeric()
# Age specifiers (previously defined)
# Sex specifiers: 1 for male, 0 for female
sex=as.factor(c(1,1,1,0,0,0,0,1,1,1,0,1,
1,0,1,1,1,0,1,1,1,1,0,
1,1,0,1,0,1,1,0,1,0,1,0,
1,1,0,0,1,1,1,0,0,1,1,1,1,
0,1,1,1,1))
# Fit the linear model to data
for (i in 1:36084) {
#print(i)
model=lm(t(norm_count[i,])~Age+sex,data=norm_count)
# coefficients are age & sex
p.val.lin[i]=(summary(model)$coefficients[2,4])
beta.lin[i]=summary(model)$coefficients[2,1]
}
head(model)
## $coefficients
## (Intercept) Age sex1
## -2.841607604 0.007816151 -0.553526570
##
## $residuals
## 1 2 3 4 5 6
## 2.06631409 1.39653077 -1.61798828 0.81780961 -0.03470712 0.27609179
## 7 8 9 10 11 12
## -0.98006789 -1.64143673 -1.66488518 1.00227663 1.77915143 2.12709346
## 13 14 15 16 17 18
## 2.14438168 2.30416091 -1.01316059 -0.92730147 -1.71959824 -2.30438942
## 19 20 21 22 23 24
## -1.75086285 -0.35470041 -0.34157551 0.97377896 -2.34347017 -0.40512175
## 25 26 27 28 29 30
## 2.58305699 -0.11958287 0.81761098 0.01120625 -1.84465666 0.59383784
## 31 32 33 34 35 36
## -0.59480849 0.94231775 -2.46071245 -0.63189359 0.53027584 -1.94626663
## 37 38 39 40 41 42
## 2.34896475 -0.28892781 -0.20035314 -0.34858546 -0.85075213 -0.34914146
## 43 44 45 46 47 48
## -0.85435617 2.18421353 0.50998157 0.31884955 -1.99316354 0.14478867
## 49 50 51 52 53
## 2.27846616 0.63733604 2.52202984 0.34326597 -2.07132505
##
## $effects
## (Intercept) Age sex1
## 20.3615195544 0.9056985323 1.9024372525 0.0136142385 -0.8316888542
##
## -0.5208899443 -1.7698359862 -1.8780780129 -1.8798855521 0.7872762662
##
## 1.0182378904 1.9265203696 1.9582358671 1.5865291934 -1.1848791312
##
## -1.0918063702 -1.8841031434 -2.9859529392 -1.8865131956 -0.4687098456
##
## -0.4555849411 0.8741968022 -2.9889655044 -0.4758493513 2.5195430293
##
## -0.7290100113 0.7540970143 -0.5982208920 -1.8937433522 0.5663920668
##
## -1.1609537960 0.9365128930 -2.9980032002 -0.6160575329 0.0001987281
##
## -1.9015760218 2.3936553591 -0.7901503730 -0.6871484238 -0.2822539382
##
## -0.7844206003 -0.2683826570 -1.3195105420 1.7190591608 0.5907403674
##
## 0.3996083541 -1.9051911001 0.2544020283 1.8710208981 0.7830175834
##
## 2.6677113827 0.4889475180 -1.9112162306
##
## $rank
## [1] 3
##
## $fitted.values
## 1 2 3 4 5 6 7 8
## -3.223179 -3.215363 -3.199730 -2.646204 -2.638388 -2.638388 -2.630572 -3.176282
## 9 10 11 12 13 14 15 16
## -3.152833 -3.152833 -2.599307 -3.137201 -3.121569 -2.552410 -3.105937 -3.098120
## 17 18 19 20 21 22 23 24
## -3.098120 -2.513329 -3.066856 -3.043407 -3.043407 -3.027775 -2.474248 -2.996510
## 25 26 27 28 29 30 31 32
## -2.988694 -2.435168 -2.988694 -2.435168 -2.973062 -2.949614 -2.388271 -2.926165
## 33 34 35 36 37 38 39 40
## -2.357006 -2.902717 -2.349190 -2.871452 -2.871452 -2.317925 -2.302293 -2.848004
## 41 42 43 44 45 46 47 48
## -2.848004 -2.832371 -2.278845 -2.278845 -2.832371 -2.832371 -2.824555 -2.801107
## 49 50 51 52 53
## -2.216315 -2.762026 -2.762026 -2.762026 -2.746394
##
## $assign
## [1] 0 1 2
# Genes that passed the threshold p-value < 0.05: 1381
# Genes that passed the threshold p-value < 0.01: 297
# Genes up-regulated/over-represented with age, under p-value < 0.01: 240
# Genes down-regulated/under-represented with age, under p-value < 0.01: 57
# Genes that passed the BH multiple testing correction: 0 :(
### LINEAR MODEL OUPUT REPORT:
# Note: here we selected a conservative p-value threshold of P < 0.01
# A reminder about error. Type I error is the false rejection of the null hypothesis and type II error is the false acceptance of the null hypothesis.
### 297 genes have been identified as significantly expressed: of which 57 are under-represented (P < 0.01 , negative β values for age) and 240 are over-represented (P < 0.01, positive β values for age)
neg_betas= which(beta.lin < 0)
pos_betas= which(beta.lin > 0)
# 1318 genes that passed the threshold p-value < 0.05
lm_total_0.05 <- rownames(norm_count)[which(p.val.lin<0.05)]
# 297 genes that passed the threshold p-value < 0.01
lm_total <- rownames(norm_count)[which(p.val.lin<0.01)]
# Under p-value < 0.01, 240 genes are over-expressed
lm_over <- rownames(norm_count)[which(p.val.lin[pos_betas]<0.01)]
# Under p-value < 0.01, 57 genes are under-expressed
lm_under <- rownames(norm_count)[which(p.val.lin[neg_betas]<0.01)]
### False Discovery Rate (FDR) p-Value Correction:
adjp=p.adjust(p.val.lin,method="BH",n=length(p.val.lin))
# Same as for Model I, Benjamini-Hochberg (aka "fdr") method is used here to control the false discovery rate.
plot(beta.lin,-log10(adjp),xlab="Beta Value", ylab="-log10(adj p-val)",
main="FDR Adjusted p-Values from Linear Regressions")
legend("topleft",c("p=0.05","p=0.01"),col=c("red","blue"),
lty=1:1,cex=0.8)
abline(1.3,0,col="red")
abline(2,0,col="blue")
abline(v=0,col="black")
Yikes no genes past the multiple testing correction…
Which means this linear model does NOT fit our data as well as the previous model.
However …we can still try to visualize it:
### LINEAR Model volcano plot
lin.info <- data.frame(p.val.lin,beta.lin)
# head(lin.info)
lin.info$threshold <- factor(ifelse(lin.info$p.val.lin<0.01 & lin.info$beta.lin < 0, 1,
ifelse(lin.info$p.val.lin<0.01 & lin.info$beta.lin > 0,-1,0)))
options(scipen=999) # turn-off scientific notation like 1e+48
linear_Vplot <- ggplot(lin.info, aes(x=beta.lin, y=-log10(p.val.lin))) +
geom_point(aes(color=lin.info$threshold),show.legend = T)+
scale_color_manual(values = c("0" = "black", "-1" = "red", "1"="blue"), labels = c("Not significant", "Over-represented", "Under-represented"))+
scale_shape(solid = F)+
geom_hline(aes(yintercept= -log10(0.01), linetype = "p = 0.01"), colour= 'black',linetype="dashed") +
geom_vline(xintercept = 0,color="black")+
xlim(c(-0.07,0.07))+ylim(c(0,10))+
theme(text = element_text(size=10))+
labs(y="-log10(p-val)",
x="Beta Coefficient",
title = "Linear Model of Significant RNAs",
color = "Legend")
plot(linear_Vplot)
# Volcano plot of the linear model output is surpisingly nice and clean?
# Alternative graphic summaries of data from the edgeR package: MA plot
# plotSmear(est_d)
#Assign/name CPM data
norm_count_heatmap = data.matrix(norm_count[which(p.val.lin<0.01), ])
# Ah the struggle with graph margins returns...
LM_HeatMap = heatmap.2(norm_count_heatmap, #input CPM data for RNAs
xlab="Age (yr)",
main="Linear Model",
labCol = Age, #participants' ages labeled on x-axis
Colv = NA, #hide column dendrogram
Rowv = T, #show row dendrogram
scale="row", #values centered and scaled in row direction
dendrogram = "row", #create row dendrogram
labRow = numeric(), #adds y-axis label
col=brewer.pal(4,"YlGnBu"), #specify color scheme and how many colors we'll use from it
trace = "none", #add lines row-wise and/or column-wise to show proportionality of measurement (e.g. if trace="row", row-wise lines with express how large z-scores are for every cell) in this case this option adds too much noise to the heat map
margins = c(5, 2),
key=TRUE, #add key to show z-scores (standard deviation)
key.title= "SD", #add key title
key.xlab="Row Z-Score", #add x-axis label to key
key.ylab = "CPM", #add y-axis label to key
cexCol = 1, #change size of axis labels
density.info = "none", #shows distribution of z-scores (RNA expression levels) of all cells in key (usually normal)
lhei = c(1,4), #adjust height and width of heat map
key.par = list(cex=0.7,mgp=c(1.5, 0.5, 0), mar=c(3, 2.5, 1, 0)), #adjust size of key
distfun = function(norm_count_heatmap) as.dist(1-cor(t(norm_count_heatmap))), #distance function
hclustfun = function(norm_count_heatmap) hclust(norm_count_heatmap, method = "complete", members = NULL) #clustering method
)
Heat map representing the expression levels of 2292 significant RNAs with age, found by the Generalized Linear Model under P < 0.05. Row-wise z-scores were calculated and color-coded to display changes in RNA expression with increasing age (22–83 years old). Yellow and blue bars refer to RNAs with very low and very high expression levels respectively, at a particular age. Additionally, a correlation-based distance method was used to cluster the RNAs obtained from the linear model with positive and negative β-values for age. RNAs with positive β-values are shown in the top of the heat map, containing RNAs with low to high expression (yellow to blue rows) with increasing age. Likewise, RNAs with negative β-values are shown on the bottom of the heat map, containing RNAs with high to low expression (blue to yellow rows) with increasing age.
tippytop.p.val.glm=which(qlf_output_hits$table$PValue<0.01) #899
top.p.val.glm=which(qlf_output_hits$table$PValue<0.05) #2292
glm_corrected <- which(qlf_output_hits$table$FDR < 0.05) #397 :)
tippytop.p.val.lin=which(p.val.lin<0.01) #297
top.p.val.lin=which(p.val.lin<0.05) #1318
lm_corrected <- which(adjp<0.05) # sadly none..
# Side Note: there's 73 genes overlapping from the two models at PValue < 0.05, these genes may be worth looking at
model_intersect_genes = norm_count[intersect(which(p.val.lin<0.05),which(qlf_output_hits$table$PValue<0.05)),] # their gene names and p_values are stored here
write.table(rownames(model_intersect_genes),
file = file.path("/Users/apple/bcb420_code/DE_data","model_intersect_genes.txt"), sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
Model I: Genewise Negative Binomial Generalized Linear Model
Number of genes that fall below threshold p-value of 0.05: 2292
Number of genes that fall below threshold p-value of 0.01: 899
Number of genes that passed FDR correction: 397
Model II: Linear Model
Number of genes that fall below threshold p-value of 0.05: 1318
Number of genes that fall below threshold p-value of 0.01: 297
Number of genes that passed FDR correction: 0
This section will use the gene list produced by the previous differential expression analysis to perform over-representation analysis (ORA). This will take into account the list of either up-regulated or down-regulated (or both) gene IDs, and test whether a gene set contains disproportionately many genes of significant expression change.
*Note: the difference between ORA and ‘gene set enrichment analysis’ (GSEA) is their inclusion level. ORA reduces the full gene expression matrix to those passing a pre-determined threshold prior to analysis while GSEA first compute “differential expression scores” for ALL genes measured before evaluating their association with gene sets / biological pathways.
KEGG[3],a manually curated source of high-level biochemical functions and pathways. g:Profiler claims to update external databases like KEGG, Reactome, and Wikipathways during the same update routine straight from the data source. The latest version of KEGG is released on Jan 1, 2022.
Reactome[4], another biological pathway database that covers the same number of genes as KEGG. However, KEGG covers more broad term pathways while Reactome contains more detailed entries. The latest version of Reactome is released on Dec 13, 2021.
WikiPathways[5], an alternative pathway database. The latest version of WikiPathway is released on Feb 10, 2022.
Note: we are using all three pathway databases available on g:Profiler to better cross-examine the output gene set terms identified by them. For example, semantically analogous pathways (e.g. “Notch signaling pathway” in KEGG and “Signaling by NOTCH” pathway in Reactome) across multiple databases are likely better descriptors of our gene lists than pathways found by only one source.
Length of input gene list = 528
Term size limit = 20 ~ 1000 genes (given that out RNA-seq data is not disease-oriented but extracted from healthy participants, we are more interested in broader pathways that are highlighted in muscle aging)
Under the BH-FDR corrected significance threshold of 0.05, and a term size limit of 20~500:
#Gene sets returned by GO:BP = 66
#Gene sets returned by KEGG = 14
#Gene sets returned by REAC = 46
#Gene sets returned by WP = 6
Output across all databases:
Notable Pathways (edited results, not entire output image): Interpretation:
For genes that are up-regulated with age, the top term identified across multiple databases is “cytoplasmic ribosomal protein translation”. This makes sense as aging is closely associated with the proteome, the gradual accumulation of mutations that arise in the aging process will lead to impaired protein synthesis. Another pathway tightly linked to aging is “oxidative phosphorylation”. During the aging process, the efficiency of oxidative phosphorylation tend to decline as a result of mitochondrial dysfunctions. Oxidative phosphorylation is also identified as a age-related pathway by authors of the study. The KEGG database is used by the original publication, from which they identified age-related sarcopenia (muscle wasting) and several other age-related neuromuscular diseases. However, according our results, the terms identified by KEGG does not seem to correlate perfectly with the authors’ findings. Several clearly non-related diseases were identified (e.g. COVID-19). This may be due to recent updates in the database that included novel gene-process associations for prominent diseases. In addition, the authors had mentioned that the top positively correlated protein–mRNA pairs were found involved in glycolysis/gluconeogenesis using KEGG databse. This is not reflected in our analysis.
Length of input gene list = 321
Term size limit = 20 ~ 2000 genes (set to larger terms because we want to highlight some of the broader, more generalized pathways associated with muscle aging e.g. tissue development)
Under the BH-FDR corrected significance threshold of 0.05:
#Gene sets returned by GO:BP = 266
#Gene sets returned by KEGG = 7
#Gene sets returned by REAC = 16
#Gene sets returned by WP = 9
Output across all databases:
Notable Pathways (edited results, not entire output image): Interpretation:
For genes that are down-regulated with age, large biological processes such as “tissue development” and “regulation of anatomical structure morphogenesis” are the top results returned by GO:BP. This makes sense as an organism’s ability to generate new tissues, as well as the maintenance of structural orientation and signalling naturally deteriorate with age. The same reasoning explains the decline “Collagen formation” and similar biosynthesis pathways identified by the REACTOME database. “Wnt signalling” is another important age-related pathway, it has long be established in literature that the impaired regenerative potential of skeletal muscle with aging is associated with the canonical Wnt signaling regulators, the lack of which will lead to an increase in tissue fibrosis. This finding was discussed in a 2007 paper published in Science[6]. Many pathways (ECM organization; AGE-RAGE signaling.. etc) identified by this analysis are also reflected in the original study.
Length of input gene list = 899
Term size limit = 20 ~ 1000 genes
Under the BH-FDR corrected significance threshold of 0.05, and a term size limit of 20~500:
#Gene sets returned by GO:BP = 58
#Gene sets returned by KEGG = 15
#Gene sets returned by REAC = 52
#Gene sets returned by WP = 5
Output across all databases:
Notable Pathways (edited results, not entire output image): Interpretation:
Compared to ORA performed using lists of up and down-regulated gene separately, the combined gene list returned a combination of terms from both analysis, which is expected. From this result, we can see that the strength/significance of top pathways associated with up-regulated genes seem to slightly trump the top terms associated with down-regulated genes. However, these results are difficult to interpret as they do not reveal the direction of association.
Table of Notable Genes: here I have highlighted three over-expressed genes with aging and three under-expressed genes with aging. These RNAs are amongt the top 20 ranked in their respective gene lists. The function and characteristics of these genes align with our expectation of muscle aging. For example, angiotensin-converting enzyme 2 (ACE2), an emzyme that protects agains aging-associated muscle wasting, is down-regulated with age. This means that as individuals age, they tend to be more prone to muscle wasting diseases.
| Expression with age | Common Name (RNA Ensembl ID) | Padj | Protein Coding | Associated Functions & Characteristics |
|---|---|---|---|---|
| up | XIST(ENSG00000229807) | 4.35E-08 | no | Age-related X chromosome inactivation, can clinically result in late-onset X-linked disoders (e.g. muscular dystrophy) |
| up | C12orf75(ENSG00000235162) | 3.05E-06 | yes | Energy metabolism, insulin signalling, aging, skeletal muscle weakness |
| up | FAM83B(ENSG00000168143) | 1.99E-05 | yes | Hypoxia response pathway, aging, musclar dystrophy |
| down | ACE2(ENSG00000130234) | 4.35E-04 | yes | Protection against age-associated muscle wasting (e.g. DND) |
| down | NPNT(ENSG00000168743) | 5.18E-03 | yes | Muscle cell niche maintenance, tissue injury repair and regeneration |
| down | CNBP(ENSG00000168714) | 9.98E-03 | yes | Juvenile tissue development, abudant in the heart and skeletal muscles |